Researcher perspectives on publication and peer review of data


Article:
Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619. doi:10.1371/journal.pone.0117619

Data:
Kratz, JE; Strasser, C (2014) Data from Researcher perspectives on publication and peer review of data. UC Office of the President. doi:10.5060/d8rp4v

This notebook:
Kratz, JE; Strasser, C (2014) Code from Researcher perspectives on publication and peer review of data. Zenodo. doi:10.5281/zenodo.16102


Abstract

Research role has no significant effect on use of channels for sharing data.

Data "publication" seeks to appropriate the prestige of authorship in the peer-reviewed literature to reward researchers who create useful and well-documented datasets. The scholarly communication community has embraced data publication as an incentive to document and share data. But, numerous new and ongoing experiments in implementation have not yet resolved what a data publication should be, when data should be peer-reviewed, or how data peer review should work. While researchers have been surveyed extensively regarding data management and sharing, their perceptions and expectations of data publication are largely unknown. To bring this important yet neglected perspective into the conversation, we surveyed ~250 researchers across the sciences and social sciences-- asking what expectations "data publication" raises and what features would be useful to evaluate the trustworthiness, evaluate the impact, and enhance the prestige of a data publication. We found that researcher expectations of data publication center on availability, generally through an open database or repository. Few respondents expected published data to be peer-reviewed, but peer-reviewed data enjoyed much greater trust and prestige. The importance of adequate metadata was acknowledged, in that almost all respondents expected data peer review to include evaluation of the data's documentation. Formal citation in the reference list was affirmed by most respondents as the proper way to credit dataset creators. Citation count was viewed as the most useful measure of impact, but download count was seen as nearly as valuable. These results offer practical guidance for data publishers seeking to meet researcher expectations and enhance the value of published data.

Introduction

Data sharing

In 1985– almost 30 years ago– Stephen Ceci surveyed 847 scientists and concluded “it is clear that scientists in all fields endorse the principle of data sharing as a desirable norm of science” (Ceci, 1988). This endorsement has not weakened over the decades; more than 65% of faculty at California Polytechnic State University (Cal Poly) affirmed the importance of data sharing in 2010 (Scaramozzino et al., 2012), as did 94% of the researchers in the United Kingdom (UK) surveyed by the Expert Advisor Group on Data Access (EAGDA) in 2013 (Bobrow et al., 2014). The respondents in 1985 survey endorsed data sharing “to allow replication and extension of one’s own findings” (Ceci, 1988), and enabling replication and (re)use is still the principal motive behind data sharing. The reproducibility problem plaguing science in the scholarly (Ioannidis, 2005; Prinz et al., 2011; Mobley et al., 2013) and mainstream (Zimmer, 2012; Economist 2013a,Economist 2013b) press could be addressed, in part, by opening underlying data to scrutiny (Drew et al., 2013; Collins & Tabak, 2014). Beyond confirming previous analyses, reuse of existing data cuts research costs (Piwowar et al., 2011) and allows new questions to be addressed (Stewart, 2010; Borenstein et al., 2011). Despite the apparent enthusiasm for data sharing in principle, Ceci alleges that in practice, “something is amiss in the academy.”

Researchers frequently fail to make data available, even when they support the idea or are obliged to do so. Alsheikh-Ali et al. examined 351 articles and found that 59% did not satisfy the data availability requirements of the journal that published them (Alsheikh-Ali, 2011). Vines et al. requested data from 516 articles published between 1991 and 2011 and obtained it less than half (47%) of the time (Vines et al., 2014). Researchers themselves agree that this is a problem. In 1985, 59% of scientists surveyed by Ceci complained that their colleagues were disinclined to share data (Ceci, 1988). Twenty-five years later, 67% of respondents to an international survey by the Data Observation Network for Earth (DataONE) affirmed that “[l]ack of access to data generated by other researchers or institutions is a major impediment to progress in science” and 50% felt that their own research had suffered (Tenopir et al., 2011). That same year, fewer than half of the 65% of Cal Poly faculty who agreed that data sharing is important followed through to share their own data (Scaramozzino et al., 2012).

Why do researchers who believe in the importance of sharing data fail to carry through? Previous surveys unearthed a number of reasons: concern about the ethical and legal issues around human subject data, mistrust that others have the expertise to use the data appropriately, hope of wringing additional articles from the data, and fear that the data will be “stolen” without credit or acknowledgment. For example, researchers brought up ethical concerns in reports from the Research Information Network (RIN) in 2008 and EAGDA in 2014; in the 2014 report, this was the second most frequently mentioned constraint (by 55% of respondents) (Swan & Brown, 2008; Bobrow et al., 2014). The risk of losing publications from premature sharing came up in 60% of a series of interviews of United States (US) scientists in 2012, more than any other risk; fear of “data theft” was mentioned in 32% of the interviews (Kim & Stanton, 2012). However, by far the most consistent reason given is that preparing and documenting data to a high enough standard to be useful just takes too much time.

In the UK, the RIN report described lack of time as a major constraint (Swan & Brown, 2008), and it was mentioned by 66% of respondents to EAGDA, more than any other constraint (Bobrow et al., 2014). Time was brought up by 44% of respondents to Kim and Stanton’s survey (Kim & Stanton, 2012), more than any other cost. It was the most frequent reason for not sharing data in the multidisciplinary DataONE survey (named by 54% of respondents) (Tenopir et al., 2011) and the second most frequent in a follow-up survey of astrobiologists (named by 22%) (Aydinoglu et al., 2014). Time investment was the second most frequently raised objection to data sharing in a 2012 survey of biodiversity researchers and the “most violently discussed obstacle” in associated interviews (Enke et al., 2012).

Although the process of preparing and documenting data for sharing could undoubtedly be streamlined with better planning, education, and tools, it will always take time and effort. The underlying problem is that this time and effort is not rewarded. Lack of acknowledgment was the third most popular objection in the biodiveristy survey and $\sim$2/3 of respondents would be more likely to share if they were recognized or credited when their data is used. In the EAGDA report, 55% of respondents said that lack of tangible recognition and rewards constrains data sharing, and at least 75% of respondents felt that the UK Research Excellence Framework (REF) does not recognize data to some or great extent relative to publications, but that it should. The need to compensate researchers who share data with scholarly prestige is a major driver of the movement toward data publication.

Data Publication

Data publication appropriates familiar terminology ("publication," "peer review") from the scholarly literature in order to insinuate data into the existing academic reward system (Costello, 2009; Lawrence et al., 2011; Atici et al., 2012). The model of data publication that most closely mimics the existing literature is the data paper. Data papers describe datasets, including the rationale and collections methods, without offering any analysis or conclusions (Newman & Cork, 2009; Callaghan et al., 2013). Data papers appear in existing journals like F1000Research and Internet Archaeology as well as new dedicated journals such as Earth System Science Data, Geoscience Data Journal (Allan, 2014), and Nature Publishing Group's Scientific Data-- which describes itself concisely as "a publication venue that credits scientists who share and explain their data". Data papers are invariably peer-reviewed based on the dataset; its description; and whether the two form a complete, consistent, and useable package (Lawrence et al., 2011). The appeal of data papers is straightforward: they are unquestionably peer-reviewed papers, so academia knows how (if perhaps not how much) to value them.

However, other data-publishing approaches abound. Data publishers include repositories such as Dryad, figshare, and Zenodo where researchers can self-deposit any kind of research data with light documentation requirements and minimal validation. Dryad requires that data be associated with a "reputable" publication, while figshare and Zenodo are completely open. Domain-specific repositories frequently have more stringent documentation requirements and access to the domain knowledge needed for thorough evaluation. For instance, the National Snow and Ice Data Center (NSIDC) evaluates incoming data in a complex process involving both internal reviewers with technical expertise and external peers with domain knowledge (Weaver & Duerr, 2012). As a final example, Open Context publishes carefully processed and richly annotated archaeology data, some of which passes through editorial and peer review (Kansa & Kansa, 2012). One thing that all of these publishers have in common is that they endeavor to make datasets formally citable (in part through assignment of stable identifiers) as a means to credit the creators.

The variety of forms of data publication attests to a general shortage of consensus on what, exactly, it means to publish data. Noting a lack of both consensus and interest on the part of researchers, the RIN report of 2008 adopted a deliberately minimal definition: "making datasets publicly available" (Swan & Brown, 2008). While eminently practical, this definition does not do much to distinguish publication from sharing (except for ruling out certain channels) or to advance its prestige. More recently, Callaghan et al. (2013) suggested distinguishing between published (available), and Published (also citable and peer-reviewed) data. We have argued that the consensus in the scholarly communications community– publishers, librarians, curators– is that published data is openly available, documented, and citable, but that what kind of validation (if any) is required to qualify is still an open question (Kratz & Strasser, 2014). It is easy to forget, however, that what data publication means to the scholarly communication community is substantially irrelevant. The point of calling data made public through whatever particular process "published" is to exploit the meaning of the word to researchers; the important definition is what data publication means to them. This paper surveys researchers to explore these kinds of semantic gaps between the scholarly communication community and researchers and seeks to ensure that researcher expectations are more obvious, so that data publishers can maximize the return on their efforts.

Researchers have been surveyed about sharing data many times this decade, but not about data publication (Harley et al., 2010; Westra, 2010; Tenopir et al., 2011; Kim & Stanton, 2012; Scaramozzino et al., 2012; Williams, 2013; Bobrow et al., 2014; Strasser et al., 2014). The RIN report of 2008 is the most recent survey to ask researchers about data publication; while the conclusions are undeniably valuable it uses the term broadly enough that it is difficult to make any distinction between attitudes towards sharing data and publishing it (Swan & Brown, 2008). Consequently, open questions abound: What would a researcher expect data publication to mean? What about peer review of a dataset? Do current models satisfy those expectations? What potential features of a data publication would be useful for evaluating the quality of the data? For evaluating the contribution of the creator(s)? To get this critical perspective we conducted an online survey of active researcher perceptions of data publication.

Methods

Ethics statement

All results were drawn from a survey approved by the University of California, Berkeley Committee for Protection of Human Subjects/Office for the Protection of Human Subjects (protocol ID 2013-11-5841). Respondents completed the survey anonymously. Researchers affiliated with the University of California (UC) could supply an email address for follow-up assistance with data publication, but neither the fact of affiliation nor any UC-specific information was used in this analysis.

Survey design

The survey contained 34 questions in three categories: demographics, data sharing interest and experience, and data publication perceptions. Demographic questions collected information on respondent’s country, type of institution, research role, and discipline. Questions to assess respondent’s existing knowledge of data sharing and publication focused on knowledge of several relevant US governmental policies and an invitation to name data journals. Data publication perceptions consisted of “mark all that apply” questions concerning definitions of data publication and peer review and Likert scale questions about the value of various possible features of a data publication. The number of required questions was kept to a minimum. Some questions were displayed dynamically based on previous answers. Consequently, n varies considerably from question to question.

The survey was administered as a Google Form, officially open from January 22 to February 28 of 2014; two late responses received in March were included in the analysis. Solicitations were distributed via social media (Twitter, Facebook, Google+), emails to listservs, and a blog post on Data Pub.

Data processing and analysis

Although the topic of the survey is benign and identification would be unlikely to negatively impact respondents, light anonymization was performed prior to analysis and release of the response data. UC affiliation and answers to UC-specific questions were redacted. Respondent locations were grouped into United States and “other.” Questions that related to US policies were analyzed based on US respondents only; one questions about the NIH was analyzed based only on US biologists. Sub-disciplines with fewer than three respondents were re-coded with the corresponding discipline. Listed data journal names were standardized manually, and free text answers to other questions were replaced with “other.” Because few questions were required, “mark all that apply” questions with no reply at all were considered to be skipped.

After anonymization, responses were filtered for analysis. Because the goal of the survey was to learn about researchers as distinct from the scholarly communication community, we exempted from analysis anyone who self-identified as a librarian or information scientist. To restrict the analysis to active researchers only, anyone who affirmed that they had not generated any data in the last five years was exempted; the 90 who respondents did not answer this question were retained. Finally, respondents without at least a Bachelors Degree were filtered out. In total, 32 respondents were removed before analysis, some by multiple criteria.

Statistical significance was tested using Fisher’s exact test where possible (i.e., for 2x2 tables) and contingency $\chi^{2}$ in all other cases (Agresti, 1992). A statistical significance cutoff of $\alpha=0.05$ was used. When testing for e.g., effects of discipline or prior experience, each answer choice was tested separately, then the Bonferroni correction for multiple hypothesis testing was applied to adjust $\alpha$ for that question; this is a conservative approach that may not detect subtle differences. Mathematicians were omitted from $\chi^{2}$ significance testing for effects of discipline because their low $n$ led to unacceptably small expected count numbers. Odds Ratios (ORs) (Szumilas, 2010) and Fisher’s exact test p-values were used to assess the relationships between items depicted in Figure 5; to enable meaningful comparison, all ORs were calculated in the direction that yielded a value $\geq1$.

The robustness of all significant results was confirmed using a jackknife procedure in which the test was repeated systematically with each respondent removed; in no cases did the absence of a single respondent raise the p-value above the designated significance threshold. Error bars represent 95% basic bootstrap confidence intervals (10,000 resamples) of the mean response in Figure 5 and the percentage of positive responses in all other figures. To facilitate comparison, $\chi^{2}$ effects were measured as Cramér’s V ($\Phi_{C}$) (Cramér, 1999).

Configuration


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import pylab
import scipy as sp
import scipy.stats as sps
import seaborn as sns

from IPython.display import display
from math import sqrt
from textwrap import wrap
pylab.rcParams['figure.figsize'] = (14.0, 6.0)

Utilities


In [ ]:
def expand_checkbox(checkbox_column, options):
    """
    checkbox_column = a series of lists of boxes checked in response to a question
    options = a list the options available to check
    
    expand checkbox_column into a DataFrame of bools with index= respondents, columns= options
    """
    return pd.DataFrame({option : checkbox_column.apply(lambda x: option in x) for option in options})

def tuple_normalize(counts, n_responses):
    """
    counts (tuple of ints): tuple of counts
    n_responses (int): total number of respondents
    
    convert a pair (or more) of counts to percetages
    """
    return map(lambda x: float(x) / n_responses * 100, counts)

In [ ]:
def interval_to_error(confidence_interval, center):
    """
    confidence_interval (tuple): low, high relative to the origin
    center: (int, float): measured value (e.g., mean)
    returns the ci as a tuple relative to the center (i.e., minus, plus the mean)
    """
    return tuple(map(lambda x: abs(float(x) - center), confidence_interval))
    
def split_interval(interval):
    """
    split a confidence interval tuple and return as a 2-element Series
    """
    return pd.Series(interval, index= ['low','high'])

Graphing functions

Graph Styling


In [ ]:
#set seaborn style
sns.set_style("white", 
              {'font.sans-serif': ['Helevetica', 'Liberation Sans', 
                                   'Bitstream Vera Sans', 'sans-serif'],
               'axes.linewidth': 0,
               'xtick.direction': 'in',
               'xtick.major.size': 8.0})

# max characters per line in graph labels
LABEL_WIDTH = 25
# color of x-axis
AXIS_COLOR = '#808080'
# color of checkbox bar graph bars
BAR_COLOR = '#08519c'
# color of confidence intervals
INTERVAL_COLOR = '#949494'

In [ ]:
def apply_cdl_style(fig, axis_color=AXIS_COLOR):
    fig.set_ylabel('')
    sns.despine(ax=fig, left=True)
    # get rid of weird dashed line
    fig.lines[0].set_visible(False) 
    
    #set font sizes
    fig.tick_params(axis='x', width=2, labelsize=14, color=axis_color)
    fig.tick_params(axis='y', labelsize=16)
    
    return fig

Bootstrap functions to generate confidence intervals


In [ ]:
def bootstrap_percentile_ci(data, n_samples=100000, alpha=0.05, stat_function=np.sum):
    """
    Calculates a confidence interval for True/False count data and returns a tuple (low, high) 
    This is a straighforward percentile calculation
    
    data (numpy array): of bools to resample
    num_samples (int): number of times to resample
    alpha (float): 1 - desired confidence interval (e.g., 0.05 for 95%)
    
    returns a tuple (low, high)
    """
    n_responses = len(data)
    # get num_samples resampled arrays (of length n) of valid indicies for data
    indicies = np.random.randint(0, n_responses, (n_samples, n_responses))
    
    # generate sorted array of desired stat in each resampled array
    stats = [stat_function(data[x]) for x in indicies]
    stats.sort()

    # return stats at the edge of the 2.5 and 97.5 percentiles
    return (stats[int((alpha/2.0)*n_samples)], stats[int((1-alpha/2.0)*n_samples)])


def bootstrap_basic_ci(data, n_samples=100000, alpha=0.05, stat_function=np.sum):
    """
    Calculates a confidence interval for True/False count data and returns a tuple (low, high)
    Calls bootsrtap_percentile_ci and converts to a basic bootstrap
    
    data (numpy array): of bools to resample
    num_samples (int): number of times to resample
    alpha (float): 1 - desired confidence interval (e.g., 0.05 for 95%)
    
    returns a tuple (low, high)
    """
    double_observed = 2 * stat_function(data)
    high, low = bootstrap_percentile_ci(data, n_samples=n_samples, alpha=alpha, stat_function=stat_function)

    return (double_observed - low, double_observed - high)

Graph answers to check box "check all that apply" questions.


In [ ]:
def graph_checkbox(question, answers, bar_color=BAR_COLOR, interval_color=INTERVAL_COLOR):
    split_checkbox = responses[question].dropna()

    # checkbox_responses== DataFrame of bools where index=individual respondents, columns=answer choices
    checkbox_responses = expand_checkbox(split_checkbox, answers)
    
    # sum checked boxes in each column; response_counts== Series with values=sums, index=answer choices
    response_counts = checkbox_responses.sum()
    
    # resample and sum from each column to bootstrap a confidence interval
    # count_confidence_intervals== Series with values= tuples (low, high), index=answer choices
    count_confidence_intervals = checkbox_responses.apply(lambda x: bootstrap_basic_ci(np.array(x)))
    
    #print checkbox_responses.apply(lambda x: bootstrap_percentile_ci(np.array(x))).apply(tuple_normalize, args=([len(checkbox_responses)]))
    #print checkbox_responses.apply(lambda x: bootstrap_basic_ci(np.array(x))).apply(tuple_normalize, args=([len(checkbox_responses)]))
    
    #normalize response_counts to percentage of total respondents to the question and sort
    response_counts = response_counts.apply(lambda x: float(x) / len(checkbox_responses) * 100)
    response_counts.sort(ascending=True)
    
    #normalize confidence intervals to percentages and sort 
    count_confidence_intervals = count_confidence_intervals.apply(tuple_normalize, args=([len(checkbox_responses)]))
    count_confidence_intervals = count_confidence_intervals.reindex(index=response_counts.index)
    
    
    #convert absolute interval values to distance below and above the observed value
    for index in count_confidence_intervals.index.values:
        count_confidence_intervals.loc[index] = interval_to_error(count_confidence_intervals.loc[index], 
                                                                  response_counts.loc[index])
    
    #split interval tuples into 2 element Series
    count_confidence_intervals = count_confidence_intervals.apply(split_interval)
    
    response_counts.index = [ '\n'.join(wrap(i, LABEL_WIDTH)) for i in response_counts.index ]

    fig = response_counts.plot(kind='barh', color=bar_color, edgecolor='w', 
                               grid=False, xlim=(0,100), fontsize=14)
    
    fig.errorbar(response_counts.as_matrix(), np.arange(len(response_counts)), 
                 xerr=count_confidence_intervals.T.as_matrix(),
                 fmt='none', ecolor=interval_color, alpha=0.65, elinewidth=2, capsize=12, capthick=2)
    
    
    apply_cdl_style(fig)

    fig.get_figure().set_size_inches(14., 2. * len(response_counts.index))    
    return fig, response_counts

Graph interrelationships between check box "check all that apply" answers


In [ ]:
def graph_fisher_exact(question, answers, alpha=0.05, labels=None, edge_color=BAR_COLOR):
    
    checkbox_responses = expand_checkbox(question.dropna(), answers)

    fig=nx.Graph()
    fig.add_nodes_from(checkbox_responses.columns)
    pos = nx.circular_layout(fig, scale=10000)

    nx.draw_networkx_nodes(fig, pos, node_size=5000, node_color='#cdcdcd', linewidths=0)
    nx.draw_networkx_labels(fig, pos, labels=labels, fontsize=10, font_family='serif')
    
    i = 0
    for a in checkbox_responses.columns:
        i += 1
        for b in checkbox_responses.columns[i:]:
            square = pd.DataFrame({ 0 : 0, 0 : 0}, index=[True, False], columns=[True, False])        
        
            # fill in counts
            square[True] = (checkbox_responses[checkbox_responses[a] == True][b].value_counts())
            square[False] = (checkbox_responses[checkbox_responses[a] == False][b].value_counts())

            odds_ratio, p = sps.fisher_exact(square.as_matrix())
            
            e_color = edge_color
            if odds_ratio < 1:
                odds_ratio, p = sps.fisher_exact(square.T.as_matrix())
                e_color = 'r'
            
            how_significant = 0.5 if p > alpha else 1.0
            
            nx.draw_networkx_edges(fig, pos, edgelist=[(a,b)], width=odds_ratio, edge_color=e_color, alpha=how_significant)

    plt.axis('off')
    return fig

Graph answers to Likert-scale (i.e., rate from 1 to 5) questions


In [ ]:
def graph_likert(questions, answers, filter_on_column=None, filter_value=None, 
                 interval_color=INTERVAL_COLOR):
    
    if filter_on_column:
        responses_ft = responses[responses[filter_on_column] == filter_value]
    else:
        responses_ft = responses
    
    collected_counts = pd.DataFrame(index=answers)
    stats = pd.DataFrame(index=questions,columns=['mean', 'ci'])

    # set up dict for converstion from likert scale (e.g., 1-5) to 0-100%
    number_of_answers = len(answers) 
    answer_to_value = dict(zip(answers, np.arange(number_of_answers)/float(number_of_answers - 1)*100)) 
    
    for column in questions:
        collected_counts[column] = responses_ft[column].value_counts().dropna()
       
        #scale responses to go from 0 to 100
        likert_values = responses[column].dropna().map(answer_to_value)
        
        #cacluate mean and 95% confidence interval
        stats['mean'].loc[column] = likert_values.mean() 
        stats['ci'].loc[column] = bootstrap_basic_ci(np.array(likert_values), stat_function=np.mean)
        
    #sort stats and collected_counts by the mean   
    stats = stats.sort_index(axis=0, by='mean', ascending=True)
    collected_counts = collected_counts.T.reindex(index=stats.index)
    collected_counts = collected_counts.div(collected_counts.sum(1).astype(float)/100, axis = 0) 
    
    #convert absolute interval values to distance below and above the observed value
    for index in stats.index.values:
        stats['ci'].loc[index] = interval_to_error(stats['ci'].loc[index], stats['mean'].loc[index])

    #split interval tuples into 2 element Series
    stats['ci'] = stats['ci'].apply(split_interval)
    
    collected_counts.index = [ '\n'.join(wrap(i, LABEL_WIDTH)) for i in collected_counts.index ]
   
    #plot percentages of each response
    fig = collected_counts.plot(kind='barh', stacked=True, grid=False, 
                                color=sns.color_palette("Blues", len(collected_counts.columns)),
                                xlim = (0,100), edgecolor='w', linewidth=2) 
    
    # plot mean and 95% confidence interval
    fig.plot(stats['mean'], np.arange(len(stats)), marker='o', color='w',axes=fig, 
             markersize=25, markeredgewidth=0, linewidth=0)
    
    fig.errorbar(stats['mean'].as_matrix(), np.arange(len(stats)), xerr=stats['ci'],
                 fmt='none', ecolor=interval_color, alpha=0.65, elinewidth=2, capsize=12, capthick=2)
    
    fig.legend(bbox_to_anchor=(0., -0.02, 1., -0.03), loc='upper left', ncol=number_of_answers, mode="expand",
                    borderaxespad=0., fontsize=14)
    
    apply_cdl_style(fig)
    
    fig.get_figure().set_size_inches(14., 2. * len(collected_counts.index))
    
    return fig

Read in and filter data

To limit analysis to active researchers, filter out librarians, undergraduates, and respondents who said that they had not generated any data in the last 5 years.


In [ ]:
EXCLUDE = {'role' : 'Librarian', 'discipline' : 'Information science', 
           'highest_degree' : 'Highschool', 'generated_data' : 'No'}

responses = pd.read_csv('DataPubSurvey_anon.csv')

for column, value in EXCLUDE.items():
    responses = responses[responses[column] != value]

Consolidate fine-grained sub-discipline answers into larger disciplines.


In [ ]:
DISCIPLINE_MAP = {'Anthropology' : 'Social science',
                  'Archaeology' : 'Archaeology',
                  'Area studies' : 'Social science',
                  'Economics' : 'Social science',
                  'Political science' : 'Social science',
                  'Psychology' : 'Social science',
                  'Sociology' : 'Social science',
                  'Astronomy' : 'Space science',
                  'Astrophysics' : 'Space science',
                  'Environmental Science' : 'Environmental science',
                  'Geology' : 'Earth science',
                  'Oceanography' : 'Environmental science',
                  'Planetary science' : 'Earth science',
                  'Biochemistry' : 'Biology',
                  'Bioinformatics' : 'Biology',
                  'Biology' : 'Biology',
                  'Evolutionary Biology' : 'Biology',
                  'Neurobiology' : 'Biology',
                  'Social science' : 'Social science',
                  'Space science' : 'Space science',
                  'Earth science' : 'Earth science',
                  'Life science' : 'Biology',
                  'Chemistry' : 'Physical science',
                  'Physics' : 'Physical science',
                  'Computer science' : 'Computer science',
                  'Mathematics' : 'Mathematics',
                  'Information science' : 'Information science',
                  'Other' : 'Other'}

responses.discipline= responses.discipline.map(DISCIPLINE_MAP).dropna()

Results

Demographics

Table 1. Demographic breakdown of the 249 researchers whose responses are analyzed here.


In [ ]:
DEMOGRAPHICS = ['discipline', 'highest_degree', 'role','institution']

for column in DEMOGRAPHICS:
    count = responses[column].value_counts()
    percentages = 100 * count.apply(lambda x: float(x) / count.sum())
    display(pd.DataFrame([count, percentages], index=['count', 'percent']).T)

We collected responses to an online survey of data publication practices and perceptions in January and February of 2014 and received 281 unique responses. Because we distributed the survey solicitation via social media and email lists and did not contact most recipients directly, we cannot estimate with any accuracy how many researchers received the solicitation or calculate a response rate. Our analysis was restricted to the 249 (81%) respondents who we deemed to be active researchers (described in Table 1.). Researchers from 20 countries responded, but most were affiliated with institutions in the US (79%, $n=197$). The institutions were largely academic (85%, $n=204$); 94% ($n=191$) of those were focused on research rather than teaching. By discipline, the largest response was from biologists (37%), followed by arch{\ae}ologists (13%), social scientists (13%), and environmental scientists (11%). We heard from researchers across the academic career spectrum: 41% ($n=102$) were principal investigators/lab heads, 24% ($n=61$) postdocs, and 16% ($n=41$) grad students. We saw few significant differences in responses between disciplines or roles, so we have presented the results in aggregate. For significance testing, we consolidated subdisciplines into 8 high-level disciplines. Given the number of respondents, this survey should have 80% power to detect small effects by chi square ($\chi^{2}$) test (size $\Phi_{C}=0.17$) and 95% power to detect medium-small effects ($\Phi_{C}=0.22$) (Cohen, 1988). For breakdown of data by discipline or role, see the full– except for redactions to preserve anonymity– raw dataset published in the University of California's Merritt repository.

Background Knowledge


How familiar are you with each of these policies?

In [ ]:
AWARENESS_QUESTIONS = ['aware_ostp_policy', 'aware_nsf_dmp', 'aware_nih_data_sharing_policy']
AWARENESS_ANSWERS = ["Never heard of it", "Heard of it", "Read about it", "Know all the details"]  

graph_likert(AWARENESS_QUESTIONS, AWARENESS_ANSWERS, filter_on_column='united_states', filter_value=True)

Researchers are generally unfamiliar with data-related funder policies. Respondents based at US institutions self-reported their familiarity with three government funder policies: the Whitehouse OSTP Open Data Initiative ($n=197$), NSF Data Management Plan requirements ($n=197$), and the NIH data sharing policy (only biologists included, $n=76$). White dots show the mean familiarity for each item; error bars depict bootstrapped 95% confidence intervals.


We asked a number of questions to assess engagement and familiarity with data sharing and publication (Figure 1). Respondents rated their familiarity with three US federal government policies related to data sharing and availability. Because these policies are specific to the US, we restricted this part of our analysis to respondents who work there. Respondents were most familiar with the National Science Foundation (NSF)'s Data Management Plan requirement (NSF, 2011). Fewer than than half had heard of the United States Office of Science and Technology Policy (OSTP) Open Data Initiative (Obama, 2013). Although the directive will eventually affect virtually all researchers who receive US government funding, awareness is most likely low because concrete policies have not been implemented yet. The much older National Institutes of Health (NIH) data sharing policy \cite{national_institutes_of_health_final_2003} was enacted 11 years ago, but only four biologists ($5\%$) claimed to know all the details, fewer than the 18 ($24\%$) who had never heard of it.

The recent rapid proliferation of data journals led us to ask about them specifically. A free text box was provided for respondents to list any data journals that they could name. Only 40 respondents (16%) named any data journals. Ecological Archives was the most frequently named, by 16 respondents. The second most frequent response was Nature Publishing Group's Scientific Data (named by 14), even though it had not started publishing at the time of the survey. Earth System Science Data (ESSD) ($n=7$), Biodiversity Data Journal ($n=6$) and Geoscience Data Journal ($n=5$) followed. A number of respondents listed non-journal data publishers: figshare ($n=6$), Dryad ($n=3$), and Zenodo ($n=1$).

Data sharing mechanisms

Data publication is a relatively new and unfamiliar concept to researchers, but most do have experience with and opinions about data sharing, and we explored those briefly before moving on to publication. Many respondents (56%, $n=140$) said that it is very important to share the data that underlies a study; differences between disciplines were not statistically meaningful ($\chi^{2}=39.1$, $p= 0.18$). Most have experience sharing their data (68%, $n=168$) or reusing another researcher's shared data (61%, $n=151$). Of the researchers who shared, 58% ($n=98$) saw their data reused by someone and 62% ($n=61$) of those reuse instances led to a published paper. Most of the respondents who reused data published a paper with it (69%, $n=104$).


Figure 2. Researchers primarily share data in response to direct contact (e.g. via email).

Figure 2A. How have you shared your data?

In [ ]:
SHARING_CHANNELS = ["Email / direct contact", "Personal or lab website", "Journal website (as supplemental material)", 
                    "Database or repository"]
graph_checkbox('how_shared', SHARING_CHANNELS)
Figure 2B. How have others obtained your data?

In [ ]:
graph_checkbox('how_others_got', SHARING_CHANNELS)
Figure 2C. How have you obtained other's data?

In [ ]:
graph_checkbox('how_you_got', SHARING_CHANNELS)
Figure 2D. What accompanied your shared data?

In [ ]:
HOW_DOCUMENTED_ANSWERS = ["A traditional research paper based on the data (with analysis and conclusions)",
                          "A data paper describing the data (without analysis or conclusions)",
                          "Informal text describing the data",
                          "Formal metadata describing the data (e.g. as XML)",
                          "Computer code used to process or generate the data",
                          "Shared with no additional documentation"]

graph_checkbox('how_documented', HOW_DOCUMENTED_ANSWERS)

Researchers primarily share data in response to direct contact (e.g. via email). Respondents who shared data indicated (A.) the channels they used to share their data, (B.) the channels others used to obtain the data, and (D.) how they documented the data. (C.) Respondents who used others' data indicated the channels through which they obtained the data. Error bars depict bootstrapped 95% confidence intervals.


Because some, but not all, means of sharing data satisfy the availability requirements of data publication (Kratz & Strasser, 2014), we asked researchers with data sharing experience about the mode of transmission (Figure 2A-C. The supplied answer choices were the four methods for external data sharing that emerged in interviews by Kim and Stanton (2012): email/direct contact, personal website, journal website, and database or repository. Email/direct contact was the most frequently reported method for sharing: 87% ($n=146$) of the respondents who shared data did so directly, 82% ($n=82$) were aware of other researchers obtaining their data directly, and 57% ($n=86$) of the respondents who reused data obtained it directly. The predominance of direct contact is probably in part an artifact of awareness– respondents necessarily know when they give someone their data directly, but they may not be notified when someone downloads it from a repository or website. Eight respondents (5%) wrote-in that they had obtained data through a channel we had not considered: extracting data from the text, tables, or figures of a published paper.

Credit for sharing data


Figure 3. Formal citation is the preferred method of crediting dataset creators.

Figure 3A. How should a dataset creator be credited?

In [ ]:
HOW_CREDITED_ANSWERS = ["Authorship on paper",
                        "Acknowledgement in the paper",
                        "Data cited in the reference list",
                        "Data cited informally in the text of the paper"]

graph_checkbox('data_sharing_credit', HOW_CREDITED_ANSWERS)
Figure 3B. If you published on someone else's dataset, how did you credit the dataset creator(s)?

In [ ]:
graph_checkbox('how_you_credited', HOW_CREDITED_ANSWERS)

Formal citation is the preferred method of crediting dataset creators. Respondents indicated (A.) how a dataset creator should be credited, (B.) how they actually credited a dataset creator in the past, and (C.) how satisfied they were with the credit they received the last time someone else published using their data. (A., B.) Respondents could select more than one item for each question. Error bars depict bootstrapped 95% confidence intervals.


Rewarding data creators is a primary goal of data publication, so we asked how a dataset creator should be credited by a reuser (Figure 3A.). The most common answer, from 83% ($n=126$) of respondents, was formal citation in the reference list. Acknowledgment also ranked highly at 62% (93). Most (30 out of 34) respondents who gave a free-text answer wrote some variant on "it depends," often citing one of two factors: the publication status of the dataset (e.g. "depends on whether the data is already published") and the role of the data in the paper (e.g. "authorship if data is [the] primary source of analysis, otherwise acknowledgment"). Because previous studies reported differences in citation practices between disciplines (Swan & Brown, 2008; Harley et al., 2010; Tenopir et al., 2011), we tested whether different disciplines responded differently (omitting Mathematics because the $n$ was too low to reliably test). Following the approach of Tenopir et al. (2011), we performed separate $\chi^{2}$ tests for each of the four provided answer choices. Using a significance cutoff corrected for multiple hypothesis testing of $\alpha= 0.05/4 = 0.0125$, we did not detect a difference between disciplines ($\chi^{2}\leq 16.4$, $p\geq 0.022$).

We also asked respondents who had published with shared data how they actually credited the creator. Reported practice fit well with theory: formal citation was the most popular method (63%, $n=81$), followed by acknowledgment (50%, $n=70$). A notable distinction was that while a few respondents (16%, $n=24$) said that it was appropriate to cite data informally in the body of the text, none admitted to actually doing it (Figure 3B.).

Many researchers fear that shared data might be used by "data vultures" who contribute little and don't acknowledge the source (Kim & Stanton, 2012). To assess how realistic this fear is, we asked respondents whose data had been reused for a publication whether they felt adequately credited. Most (63%, $n=54$) felt satisfied, and a combined 78% ($n=67$) felt the credit was appropriate or excessive. This left 22% ($n=13$) who were unsatisfied; only 2 (2%) felt that the credit was "very insufficient." These differences in satisfaction could derive from different attitudes toward appropriate credit. To test this, we collapsed responses into three categories (insufficient, appropriate, and excessive) and tested for independence with each of the four provided answers in data sharing credit, but none of the relationships were statistically meaningful (corrected $\alpha= 0.0125$, $\chi^{2}\leq 3.26$, $p\geq 0.20$).

Expected features of data publication and peer review


Figure 4. Researcher expectations of data publication center on availability, not peer review.

Figure 4A. How would you expect a published dataset to differ from a shared one?

In [ ]:
DP_FEATURES = ["Openly available without contacting the author(s)",
               "Deposited in a database or repository",
               "Assigned a unique identifier such as a DOI",
               "A traditional research paper is based on the data",
               "A data paper (without conclusions) describes the data",
               "Packaged with a thorough description of the data",
               "Packaged with formal metadata describing the data (e.g. as XML)",
               "Dataset is \"peer reviewed\""]
fig = graph_checkbox('publish_definition', DP_FEATURES)
Figure 4B. What would you expect data peer review to consider?

In [ ]:
PR_FEATURES = ["Collection and processing methods were evaluated",
               "Descriptive text is thorough enough to use or replicate the dataset",
               "Necessary metadata is standardized (e.g. in XML)",
               "Technical details have been checked (e.g. no missing files no missing values)",
               "Plausibility considered based on area expertise",
               "Novelty/impact considered"]

fig = graph_checkbox('peer_review_definition', PR_FEATURES)

Researcher expectations of data publication center on availability, not peer review. Respondents conveyed the expectations raised by the terms (A.) publication and (B.) peer review in the context of data. Respondents could select more than one item for each question. Error bars depict bootstrapped 95% confidence intervals.


The central question we hoped to answer is what "data publication" and "data peer review" actually mean to researchers. We decomposed the prevalent models of data publication into a set of potential features and asked respondents to select all the features that would distinguish a "published" dataset from a "shared" one (Figure 4A). The most prevalent expectations relate to access: 68% ($n=166$) expect a published dataset to be openly available and 54% ($n=133$) expect it to be in a repository or database. Substantially more researchers expected a published dataset to be accompanied by a traditional publication (43%, $n=105$) than by a data paper (22%, $n=55$). Only a minority of 29% ($n=70$) expected published data to have been peer-reviewed.

Much of the prestige of scholarly publication derives from surviving the peer review process. It is natural, then, that many data publication initiatives model their validation process on peer review and employ the term for its prestige and familiarity. However, it is not obvious exactly how literature peer review processes and criteria should be adapted for data or what guarantees it should make. We asked what researchers expect from data peer review, providing a selection of considerations that data reviewers might take into account (Figure 4B). The most common responses sidestepped examination of the data itself; 90% ($n=220$) of respondents expected evaluation of the methods and 80% ($n=196$) of the documentation. There was little (22%, $n=53$) expectation that data reviewers would consider novelty or potential impact.

We tested for differences in expectations of both data publication and peer review among disciplines and between research roles. No significant differences between roles emerged. The only two significant differences among disciplines related to structured metadata: discipline had a significant effect on expectation of formal metadata in the publication process (corrected $\alpha= 0.006$, $\chi^{2}= 33.0$, $p= 2.6\times10^{-5}$) and consideration of standardized metadata in peer review (corrected $\alpha= 0.008$, $\chi^{2}= 26.7$, $p= 3.8\times10^{-4}$). In both cases, the most notable distinctions were the expectations of a large fraction of environmental scientists: 63% (compared to 25% in the population as a whole) for publication and 73% (compared to 39%) for peer review. This popularity among environmental scientists may be driven by the use of a mature metadata standard in the field, Ecological Metadata Language (EML) (Blankman & McGann, 2003).


Figure 5. Researchers have coherent expectations of data publication and peer review.

Figure 5A. How would you expect a published dataset to differ from a shared one?

In [ ]:
dp_labels = {"Openly available without contacting the author(s)" : "openly\navailable",
             "Deposited in a database or repository" : "repository\ndeposit",
             "Assigned a unique identifier such as a DOI" : "unique\nID",
              "A traditional research paper is based on the data" : "traditional\npaper",
              "A data paper (without conclusions) describes the data" : "data\npaper",
              "Packaged with a thorough description of the data" : "thorough\nmetadata",
              "Packaged with formal metadata describing the data (e.g. as XML)" : "formal\nmetadata",
              "Dataset is \"peer reviewed\"" : "peer\nreview"}

graph_fisher_exact(responses.publish_definition, DP_FEATURES, labels=dp_labels, alpha=0.05/28)
Figure 5B. What would you expect data peer review to consider?

In [ ]:
pr_labels = {"Collection and processing methods were evaluated" : "methods\nappropriate",
               "Descriptive text is thorough enough to use or replicate the dataset" : "thorough\nmetadata",
               "Necessary metadata is standardized (e.g. in XML)" : "standard\nmetadata",
               "Technical details have been checked (e.g. no missing files no missing values)" : "technical\ndetails",
               "Plausibility considered based on area expertise" : "data\nplausible",
               "Novelty/impact considered" : "novelty/\nimpact"}

graph_fisher_exact(responses.peer_review_definition, PR_FEATURES, labels=pr_labels, alpha=0.05/15)

Researchers have coherent expectations of data publication and peer review. Graph of relationships between the researcher expectations shown in Figure 4. Nodes are potential (A.) publication features or (B.) peer review assessments. Edge width is proportional to relationship strength as measured by odds ratio. Blue edges show positive relationships, red are negative. Dark edges are significant at the $\alpha=0.05$ level by Fisher's exact test, with correction for multiple hypothesis testing to (A.) $\alpha=0.0018$ and (B.) $\alpha=0.0033$.


To learn whether respondents selected data publication features or peer-review assessments independently or as coherent constellations of ideas, we performed Fisher exact tests of independence between every pair of features (Figure 5). Within data publication, we found a dense set of statistically significant associations among items related to access and preservation (at the $\alpha=0.05$ level, corrected to $\alpha=0.0018$). For example, repository deposit was linked to openly availability ($Odds Ratio=9.55$, $p=7.6\times10^{-14}$), assignment of unique identifier, ($OR= 3.33$, $p= 1.33\times10^{-5}$), and both formal ($OR= 4.49$, $p= 3.94\times10^{-6}$) and rich ($OR= 3.83$, $p=1.14\times10^{-6}$) metadata. Formal and rich metadata were themselves linked ($OR= 12.5$, $p=2.7\times10^{-14}$). Formal metadata was also linked to assignment of a unique identifier, which is sensible in that an identifier is meaningless without metadata ($OR= 7.92, p= 5.05\times10^{-11}$). Another carrier for metadata, a data paper, was linked to both rich ($OR=3.64$, $p=4.22\times10^{-5}$) and formal ($OR=4.30$, $p=1.32\times10^{-5}$) metadata. Data papers were the only item associated with peer review ($OR=3.00$, $p=0.0011$). Traditional papers had no significant associations at all; the closest was with data paper ($OR=1.93, p=0.023$).

Potential considerations during peer review are also linked significantly (with a corrected cutoff of $\alpha=0.0033$). Three assessments were strongly interlinked: from appropriate methods to standardized metadata ($OR=7.91$, $p=0.00081$) to technical evaluation ($OR=4.56$, $p=3.4\times10^{-6}$) and back ($OR=5.9$, $p=7.9\times10^{-5}$). Plausibility correlated with other factors that require domain expertise: appropriate methods ($OR=4.51, p=0.0014$), adequate documentation ($OR=4.87$, $p=2.5x10^{-6}$), and novelty/impact ($OR=5.50$, $p=1.1x10^{-5}$). Plausibility was the only association for novelty/impact.

Valued data-publication features


Figure 6. Researchers trust and value peer review highly.

Figure 6A. Which of the following evaluations have you made?

In [ ]:
REVIEW_ACTIONS = ["reviewed a journal article",
                  "reviewed a grant proposal",
                  "reviewed an application to graduate school",
                  "reviewed a CV to hire someone for your lab",
                  "served on a hiring committee",
                  "served on a tenure & promotions committee"]
fig = graph_checkbox('researcher_review_experience', REVIEW_ACTIONS)
Figure 6B. How much confidence in a dataset does each attribute inspire?

In [ ]:
DATA_TRUST = ['traditional_paper_confidence', 'data_paper_confidence', 'peer_review_confidence', 'reuse_confidence']
DATA_TRUST_SEQUENCE = ["No confidence", "Little confidence", "Some confidence", "High confidence", "Complete confidence"]

graph_likert(DATA_TRUST,DATA_TRUST_SEQUENCE)
Figure 6C. How useful is each metric in assessing dataset value/impact?

In [ ]:
DATA_IMPACT = ['impact_citation', 'impact_downloads', 'impact_altmetrics', 'impact_google_rank']
DATA_IMPACT_SEQUENCE = ["Not at all useful", "Slightly useful", "Somewhat useful", "Highly useful", "Extremely useful"]

graph_likert(DATA_IMPACT, DATA_IMPACT_SEQUENCE)
Figure 6D. How much weight would you give each item on a researcher's CV?

In [ ]:
PUBLICATION_VALUE = ["traditional_paper_value", "data_paper_pr_value", "data_paper_npr_value", "dataset_pr_value", "dataset_npr_value"]
PUBLICATION_VALUE_SEQUENCE = ["None", "A small amount", "Some", "Significant", "A great deal"]

graph_likert(PUBLICATION_VALUE, PUBLICATION_VALUE_SEQUENCE)

Researchers trust and value peer review highly. (A.) Respondents reported their past experience evaluating other researchers in each context; respondents could select more than one item. Respondents reported (B.) how much trust each data publication feature inspires, (C.) how useful each metric would be for assessing impact, and (D.) how valuable a CV item each kind of data publication would be. White dots show the mean response for each item; error bars depict bootstrapped 95% confidence intervals.


Validation of published data facilitates use only if potential users trust the means of assessment. To learn which means researchers trust, we presented respondents with four possible features and asked how much to rate how much confidence each would confer (Figure 6B). All four inspired at least some confidence in most researchers (ranging from 89% to 98%). Respondents trusted peer review above all else: 72% ($n=175$) said it conferred high or complete confidence and only 2% ($n=4$) would feel little or no confidence. The second most trusted indicator was knowledge that a traditional paper had been published with the data; 56% ($n=137$) would have high or complete confidence. Reuse of the data by a third party came in third, with 43% ($n=106$). Description by a data paper was the least convincing at 37% ($n=89$) high or complete confidence, although reuse inspired little or no confidence in more respondents (11%, $n=25$).

Beyond reuse, data publication should reward researchers who create useful datasets with credit. To that end, we asked what metrics researchers would most respect when evaluating a dataset's impact (Figure 6C). Respondents considered number of citations to be the most useful metric; 49% ($n=119$) found citation count highly or extremely useful. Unexpectedly, a substantial 32% ($n=77$) felt the same way about number of downloads. The distinction between citation and download counts shrinks to 9% if the comparison is made at the level of at least somewhat useful (82% versus 73%). Only a minority of respondents considered search rank (42%, $n=102$) or altmetrics (37%, $n=91$) to be even somewhat useful.

Even before quality or impact enter consideration, the prestige associated with publishing a dataset is influenced by its format. We distilled a multiplicity of data publications formats to four generic models– with or without a data paper and with or without peer review– and asked respondents how much each would contribute to a researcher's curriculum vitae (Figure 6D). As a point of comparison, respondents also rated the value of a traditional paper; 60% ($n=145$) give one a great deal of weight and another 36% ($n=87$) give it significant weight. The most valuable data publication model was data published with a peer-reviewed data paper, but even that was only given a great deal of weight by 10% ($n=23$), although another 46% ($n=109$) gave it significant weight. A peer-reviewed dataset with no paper dropped to 5% ($n=12$) giving a great deal of weight, while an un-peer-reviewed data paper dropped to 1% ($n=2$). Thus, peer review outweighed having a data paper as a factor. A substantial, 27% ($n=65$) would award an un-peer-reviewed dataset no weight at all. For this question, which explicitly addressed evaluation of dataset creators, we were particularly interested in the 26% (59) of survey respondents who had experience on a tenure and promotions committee. We compared their responses to each feature with those who had not served on a committee by $\chi^{2}$, but found no significant relationships (corrected $\alpha= 0.01$, $\chi^{2}\leq 8.39$, $p\geq 0.078$).

Discussion

Demographics, statistical power, and bias

Although this survey was international in scope, most of the respondents were affiliated with institutions in the United States. The respondents here (84% North American) resemble those of the DataONE survey (Tenopir et al., 2011) (73% North American); many of the previous surveys were conducted entirely in the US (Ceci, 1988; Kim & Stanton, 2012; Scaramozzino et al., 2012). The in-depth reports prepared by EAGDA (Bobrow et al., 2014) and the RIN (Swan & Brown, 2008) were carried out in the UK, where a single assessment framework, the REF dominates, creating a significantly different environment in terms of credit. The bulk of our responses (85%) came from academic institutions, which is similar to DataONEs 81% (Tenopir et al., 2011).
Ceci's initial survey was academic (Ceci, 1988), and Scaramozzino's was conducted entirely at a single teaching-oriented university (Scaramozzino et al., 2012). In this respect, the population here is quite comparable to previous surveys.

Researchers in all of the major roles in academia and a variety of disciplines responded. In terms of role, our respondents again resemble those of the DataONE survey. There, 47% were professors and 13.5% grad students; here, 41% were principal investigators and 16% grad students (Tenopir et al., 2011). Most other surveys were restricted to principal investigators. An exception, the EAGDA survey, still mostly (69%) heard from principal investigators (Bobrow et al., 2014). Our largest response was from biologists (37%), followed by archaeologists (13%), social scientists (13%), and environmental scientists (11%). DataONE heard mostly from researchers in its area of focus, environmental sciences and ecology (36%), followed by social science (16%) and biology (14%) (Tenopir et al., 2011). Scaramozzino's survey included a high proportion of physicists and mathematicians, but 18\% of respondents were biologists (Scaramozzino et al., 2012). The EAGDA survey was heaviest in biomedical fields, such as epidemiology and (26.8%), genetics/genomics (20), but also featured 31.4% social scientists (Bobrow et al., 2014).

Whereas DataONE uncovered statistically distinct data sharing attitudes between respondents in different disciplines, we did not. The effect sizes observed in tables 21 and 22 of Tenopir et al., (2011)– which most closely parallel the questions about appropriate credit for sharing data presented here– range from an effect size of $\Phi_{C}=0.11$ to $0.17$. This survey should have 80% sensitivity to an effect size at the top of this range, $\Phi_{C}=0.17$, so we find it plausible that the detection of statistically meaningful distinctions in one survey and not the other could an artifact of the difference in statistical power (from an $n$ of $1329$ vs. $249$) rather than a reflection of real differences in the respondent populations. However, $0.17$ is comfortably a "small" effect, so we are unlikely here to have missed large or even moderate effects by chance (Cohen, 1988).

As is the case for many of the previous surveys, participation was voluntary and open, so our sample may be biased toward researchers with an interest in data sharing and publication. However, a high proportion of respondents (84%) did not name any data journals, especially relative to the 40% of EAGDA respondents who were unfamiliar with the format (Bobrow et al., 2014). That and the low awareness of US federal policies (e.g., 35% of US respondents had never heard of the NSF data management plan requirement and 6% had never heard of the OSTP Open Data Initiative) suggest that our respondents are not atypically invested in these issues.

Data publication

The RIN report of 2008 concluded that "...'publishing' datasets means different things to different researchers" (Swan & Brown, 2008) and we found that little has changed. Even the most frequently named defining feature in this survey, open availability, was only chosen by $\sim$2/3 of respondents. One respondent simply wrote "terms are confusing." However, the emergence of systematic relationships between some of the features demonstrates that the responses were not utterly confused. We observed two, arguably three, independent concepts of data publication.

The most widely held concept centers on present and future access. Open availability tightly correlates with the second most frequent feature, repository deposit. Repository deposit correlates with three other conceptually related features (unique identification, rich metadata, and formal metadata), and numerous interconnections unite all five of these features. This concept of publication maps well onto virtually all present data publication implementations, including lightweight approaches like figshare and Zenodo.

The second concept lingers from the pre-digital days of scholarly communication: published data is data that has been used or described in a traditional journal article. Nearly half (43%) of the respondents chose "basis of a research paper" as a defining feature of data publication. Surprisingly, this and the previous concept did not compete, but were instead almost completely independent. The traditional paper concept reflects how researchers speak (e.g. to "publish an experiment" is to publish a research paper that uses the experiment), but does not match the conversation in the scholarly communication community, where data that had been used or described but not made available would not be considered to have been published and, conversely, data that has been made available but never used in a research paper might be. This mismatch is a potential source of misunderstanding that the scholarly communication community should be aware of.

The third concept, not entirely independent from the first, is that a published dataset is one that has been described by a data paper. Data papers correlate with peer review and both kinds of metadata, but not with features related to the disposition of the data (e.g. open availability or repository deposit), even though virtually all data paper publishers require repository deposit. Data papers conferred less trust than any other feature, but only by a small margin: 36% of respondents derive high or complete confidence from a data paper, compared to 44% from successful reuse. Respondents regarded data papers as much less valuable than traditional research papers: 60% would give traditional paper a great deal of weight, but only 10% would value a data paper that highly. Only 16% had been able to name a data journal at the start of the survey, and data papers may come to be valued more as awareness spreads; one respondent wrote "I've never heard of this, but it sounds fantastic." Alternatively, research communities may conclude that data papers should be valued less. Already, 55% of respondents gave a data paper significant (or higher) value, and that may ultimately be appropriate. Data papers clearly add perceived value to a dataset, but not as much as peer review.

Validating published data

Quality control via peer review is integral to traditional scholarly publication so it is no surprise that, in reference to data publication, the RIN noted "[t]here is, for some, also an implication that the information has been through a quality control process" (Swan & Brown, 2008). Even in regard to novel material like data, researchers trust the traditional scholarly publication process: our respondents trusted peer review and use in a research paper more than any other indicators of quality. However, less than half expected published data to have been used in a published research paper and only one third expected it to have been peer reviewed. We conclude, with the RIN, that researchers don't have a clear idea what quality control to expect from published data. In this uncertainty, the research and scholarly communication communities are in perfect agreement. How, and how extensively, to assess data quality is the least settled of the many open questions surrounding data publication, and different initiatives take a variety of approaches, including collecting user feedback, distinct technical and scientific review, and closely modeling literature peer review (Kratz & Strasser, 2014).

Peer review establishes the trustworthiness of dataset and elevates its perceived value more than any other factor in this survey. Despite one respondent's remark that "I have never heard this term applied to a dataset and I don't know what it means," expectations of peer review were more consistent than of publication. Whereas only 68% of respondents selected even the most popular feature in the question on data publication, 90% agreed that they expect data peer review to include evaluation of collection and processing methods. In fact, half of the peer review assessments were selected by more than 68% of respondents.

Unsurprisingly, a majority of respondents expect assessments that require domain expertise, i.e. that peer review involve review by peers in their field. Assessment of plausibility was linked with three other assessment that require domain expertise: method evaluation, adequacy of metadata for replication, and potential novelty/impact. The high (80%) expectation that peer review of data includes peer review of its documentation/metadata suggests that researchers are aware of the critical importance of documentation for data reuse and replication. That and the low (22%) expectation that peer review consider novelty/impact are in line with current data journal peer review processes and guidelines (Kratz & Strasser, 2014). However, our survey question focused on the aspects of a data publication that might be assessed, not the review process, and peer review expectations might be satisfied through any number of pre- or post-publication processes. We conclude that models of data publication without peer review are unlikely to confuse researchers, but that peer review greatly enhances both reuse and reward. Furthermore, assessment processes that at least meet the expectations of peer review will be critical for data publications to attain a status at all comparable to that of journal articles.

The idea that "data use in its own right provides a form of review" (Parsons et al., 2010) is frequently expressed in the conversation around data publication. Reuse could be documented through citations from research papers to the dataset or direct feedback from researchers who used the data. Based on past experiences, we were surprised that successful reuse did not inspire more trust; both peer review and "basis of a traditional paper" inspired slightly more confidence than reuse. It is worth nothing that serving as the basis of a research paper by the dataset creator is itself evidence of successful use, just not by a third party. However, respondents did consider citations to be the most useful metric for assessing value/impact. This apparent contradiction could result from evaluating trustworthiness and impact differently or from different concepts of "successful" reuse and reuse that that results in a citation. The combined value of enhancing trust and establishing impact makes tracking dataset citations eminently worthwhile, but still no substitute for peer review.

Credit for publishing data

The scholarly communication community agrees that data should be cited formally in the reference list (Joint Declaration of Data Citation Principles), but this is rarely actually done (Sieber & Trumbo, 1995; Mooney, 2011; Mooney & Newton, 2012). In a 1995 survey of 198 papers that used published social science datasets, 19% cited the dataset with at least the title in the reference list (Sieber & Trumbo, 1995). A followup 17 years later found that only 17% of papers meeting even this low standard, showing that practice has not improved (Mooney & Newton, 2012). The most common actual approach is informal citation in the methods or results section of the paper; 30.8% of papers in 1995 and 69.2% in 2012 included the dataset title somewhere in the text. Notwithstanding thi dismal state of practice, researchers agree that the correct approach is formal citation; 95% of respondents to DataONE said that formal citation was a fair condition for data sharing, 87% of astrobiologists said the same, and 71% of biodiversity researchers said they would like their data to be cited ""in the references like normal publications" (Tenopir et al., 2011; Aydinoglu et al., 2014; Enke et al., 2012). Here too, formal citation was the most popular response to both how a dataset creator should be credited and how the respondent actually credited data creators. No respondents admitted to citing data informally in the text. This apparent disconnect between what is observed in the social science literature and self-reported practice could arise in any of a number or ways: it may be that social science is not a representative discipline, that occasions when respondents cited data formally are easier to bring to mind, or that researchers define dataset reuse differently than the authors of the literature surveys. For instance, a biologist who uses a sequence from GenBank and mentions the accession number in the methods section of the paper might not think of that activity as data reuse warranting a formal citation. Beyond notions of credit, formal data citations are useful to the 71% of respondents to the EAGDA survey who already track use of their datasets "through details of publications generated using the data" (Bobrow et al., 2014). We conclude that researchers are aware of the benefits of formal data citation and suggest that data citation efforts focus on implementation rather than persuasion.

While respondents deemed citation the most useful metric of dataset value, they also attached high value to download counts. These preferences align with the practices reported in the EAGDA survey, where 43% of respondents tracked downloads of their datasets (Bobrow et al., 2014). In the present scholarly communication infrastructure, repositories can count downloads much more easily than citations; citations are preferable, but downloads are the "low hanging fruit" of data metrics. In comparison to download counts, appreciation of altmetrics (e.g. mentions in social media or the popular press) was low: only one third of respondents found them even somewhat useful in assessing impact. Altmetrics for research articles are still being developed, so it is not surprising that researchers are unsure what they might signify for data. For data publishers, there is certainly no harm in providing altmetrics-- and a majority of respondents did find them at least slightly useful-- but they are unlikely to have much impact in the short term.

Researchers see the time required as the biggest cost to data sharing, but the risk they most fear is that "data vultures" will strip the data for publications without adequately acknowledging the creator(s) (Kim & Stanton, 2012; ). To learn how well-founded these fears are, we asked respondents how satisfied they were with their credit the last time someone published using their data. The majority felt that the credit was appropriate, but the fraction that felt shortchanged (22%) is too large to ignore, and we must conclude that this dissatisfaction is a real problem. Whether the problem is ultimately with the way dataset creators are credited or the way dataset creators expect to be credited is for research communities to decide. We can say that there was no significant difference in how satisfied and dissatisfied respondents thought dataset creators should be credited, so the variability in satisfaction was most likely driven by variability in credit received rather than the respondent's expectation. As data publication takes shape, the problem can be reduced by solidification of community norms around data use, increased prestige for dataset creators, and better adoption of formal data citation.

Practical conclusions

The results of this survey offer some practical guidance for data publishers seeking to meet researcher expectations and enhance the value of datasets. Above all else, researchers expect published data to be accessible, generally through a database or repository; this fits well with current practice and, indeed, with the idea of publication at its most fundamental. The research and scholarly communication communities agree that formal citation is the way to credit a dataset creator, and a number of steps can be taken to encourage this practice. Data publishers should enable formal citation (e.g. by assigning persistent identifiers and specifying a preferred citation format), and article publishers should encourage authors to cite data formally in the reference list. Data publishers should track and aggregate citations to their datasets to the extent feasible; at a minimum, they should publicize download counts, which are less valued by researchers but easier to implement. Data papers enhance dataset value, but much of the value of a peer-reviewed data paper can be obtained by peer review alone. Peer review is not integral to data publication for researchers, but it remains the gold standard of both trustworthiness and prestige. Repositories and databases can make data more useful to both creators and users by incorporating peer review, whether by managing the process themselves or integrating with peer-reviewed data journals. While many aspects of data peer review are unresolved, two clear expectations that should be met are that true peers will supply domain expertise and that evaluation of metadata will play a significant role.